Comparison between vision transformers and convolutional neural networks to predict non-small lung cancer recurrence

Non-Small cell lung cancer (NSCLC) is one of the most dangerous cancers, with 85% of all new lung cancer diagnoses and a 30–55% of recurrence rate after surgery. Thus, an accurate prediction of recurrence risk in NSCLC patients during diagnosis could be essential to drive targeted therapies preventing either overtreatment or undertreatment of cancer patients. The radiomic analysis of CT images has already shown great potential in solving this task; specifically, Convolutional Neural Networks (CNNs) have already been proposed providing good performances. Recently, Vision Transformers (ViTs) have been introduced, reaching comparable and even better performances than traditional CNNs in image classification. The aim of the proposed paper was to compare the performances of different state-of-the-art deep learning algorithms to predict cancer recurrence in NSCLC patients. In this work, using a public database of 144 patients, we implemented a transfer learning approach, involving different Transformers architectures like pre-trained ViTs, pre-trained Pyramid Vision Transformers, and pre-trained Swin Transformers to predict the recurrence of NSCLC patients from CT images, comparing their performances with state-of-the-art CNNs. Although, the best performances in this study are reached via CNNs with AUC, Accuracy, Sensitivity, Specificity, and Precision equal to 0.91, 0.89, 0.85, 0.90, and 0.78, respectively, Transformer architectures reach comparable ones with AUC, Accuracy, Sensitivity, Specificity, and Precision equal to 0.90, 0.86, 0.81, 0.89, and 0.75, respectively. Based on our preliminary experimental results, it appears that Transformers architectures do not add improvements in terms of predictive performance to the addressed problem.

Several works have been proposed on the prediction of recurrence-free survival and overall survival in NSCLC patients.However, the state-of-the-art is lacking of models designed for the early prediction of disease recurrence.Furthermore, although all proposed models show encouraging results, they are still not suitable for a clinical application, even when they involve genomic-based models which are expensive and time-consuming procedures.In recent years, artificial intelligence has already demonstrated its potential in defining predictive and prognostic models.Specifically, the predictive power of radiomic features extracted from biomedical images is now well established in the scientific community [4][5][6][7][8] .
Recently, radiomics via Convolutional Neural Networks (CNNs) has been extensively used showing strong potential [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] .CNNs can be of two types: custom or pre-trained.In the former, scientists build their own network which is then trained to execute a specific task; in the latter case, a transfer-learning approach is used [15][16][17][18][19][20] .Networks are first trained on millions of images of different classes (e.g., ImageNet) in recognizing specific patterns like edges, dots, color gradients, shapes, etc. 21 .After that, this gained knowledge is transferred to the specific set of images to study.In this work, we adopted only the transfer learning approach.Typically, CNNs consist of several layers of convolutions and max pooling.When applied to images, the bottom layers (close to the input layer) focus on local simple features like edges, dots, and color gradients; higher layers, instead, combine the previous features into more complex ones and can be used to train Machine Learning models.
However, CNNs require high computational resources; second, they focus more on the entire image instead of its portions which could contain the lesion 22,23 .
In this scenario, in light of innovative algorithms proposed in the literature, the aim of our work was to compare the performances of different state-of-the-art deep learning algorithms to predict disease recurrence in NSCLC patients.To the best of our knowledge, the state-of-the-art lacks a comparative study on the classification performances obtained by these two architectural families in relation to the problem of disease recurrence prediction evaluated on the same reference dataset.This information would allow us to lay the foundations for future studies aimed at defining and validating an accurate model of personalized medicine.Therefore, in this preliminary work, we used various Transformer architectures to predict NSCLC recurrence 14,[49][50][51][52] .We used a public database of CT images of 144 NCSLC patients for recurrence classification comparing the performances of ViTs and CNNs 53 .The paper is organized as follows: in Section "Results", Materials and Methods, we introduce the database of patients and the network architectures; then, in Sections "Discussion and conclusion" and "Materials and methods", Results and Discussion, we present the results of our transfer-learning-based model, discussing their performances.
As additional result, Fig. 5 shows a histogram of the validation loss values, averaged over all the epochs, folds and rounds of cross-validation, for ViTs, PVTs, Swins and CNNs.ViTb_16, PVT-B1, Swin-tiny, and InceptionV3 show the lowest validation loss in the histogram within their family.The best trade-off between the performances achieved and the loss valued was reached by InceptionV3.

Discussion and conclusion
The a m of the study was to evaluate the performances of different deep learning algorithms for predicting recurrence in NSCLC patients by analyzing baseline CT.Our experimental results showed that ViTb_16, has higher performances, reaching an AUC and Accuracy values of 0.84 ± 0.04 and 0.83 ± 0.05, respectively, against ViTb_32, values equal to 0.67 ± 0.05 and 0.64 ± 0.08 respectively due to their different architectures.Indeed, ViTb_16 decomposes the input images into patches of size 16 × 16 pixels, while ViTb_32 into patches of size 32 × 32 pixels.Therefore, if the patch size is smaller, the transformer encoder's attention would be higher, bringing to a better   classification.As regards the Swin cases, both Swin-tiny and Swin-small are comparable (AUC = 0.82 ± 0.04 and 0.80 ± 0.07; Accuracy = 0.79 ± 0.04 and 0.77 ± 0.04 respectively).The best performances among the considered Transformers techniques are reached with PVT-B1 with the AUC and Accuracy value of 0.90 ± 0.04 and 0.86 ± 0.04 respectively.These better performances, among all considered Transformers, could depend on the PVT overlapping patch embedding mechanism allowing the Transformer to extract more information from the CT image than ViTs and Swins 29 .In the end, the best performances of this study are reached via pre-trained CNN InceptionV3 with AUC and Accuracy equal to 0.91 ± 0.03 and 0.89 ± 0.04 respectively.Even if CNNs perform best, the considered Transformers ViTs, PVTs and Swins still reach high and comparable performances.
As regards the topic of NSCLC classification, we scanned the literature and, to the best of our knowledge, we identified the state-of-the-art works which mainly use clinical features or radiomic ones.The latter can be further split into handcrafted features or extracted via CNNs 14, 49-52 .To the best of our knowledge, we use pre-trained ViTs, PVTs, and Swins for the first time, for the specific task of NSCLC classification.Table 1 summarizes the principal results proposed in the state-of-the-art according to the topic of our clinical task.
S. Hindocha et al. predicted recurrence, recurrence-free survival, and overall survival of NSCLC patients, employing only clinical features from a cohort of 657 patients.As regards the task of recurrence prediction, an AUC equals to 0.69 was reached 51 .In the work of Wang et al., for example, CT images from a cohort of 157 NSCLC patients were analyzed using only handcrafted-radiomic features reaching an a ccuracy equals to 0.85 52 .
As regards NSCLC recurrence radiomic studies based on deep learning models, we mention the works of Aonpong et al., Kim et al., and Bove et al. 14,49,50 .In the former, Authors used a subsample of our same radiogenomic database to predict the NSCLC recurrence implementing a genotype-guided radiomic model focusing on a smaller cohort of 88 patients 50 .Using various state-of-the-art CNNs, gene expression data were extracted from CT images achieving an AUC equals to 0.77, and a ccuracy equals to 0.83.In the second one, Kim et al. 49 built various ensemble-based prediction models using a database of 326 patients including our one.Clinical data, handcrafted radiomic features, and deep learning radiomic ones were considered and combined with each other.The best performances combining all together were AUC equals to 0.77, and Accuracy equals to 0.73.Finally, in the work of Bove et al. a transfer learning approach was implemented extracting radiomic features from the cropped CT images, around the tumor area, of our same NSCLC radiogenomic dataset 53 via pre-trained CNNs,  www.nature.com/scientificreports/reducing the number of radiomic features and combining them with the clinical data of the database.The best reached performances consisted of AUC and Accuracy equal to 0.83 and 0.79 respectively 14 .Considering all the results, in our model pre-trained CNN InceptionV3 seems to outperform the state-ofthe-art works on NSCLC recurrence classification topic.
We would like to underline that the comparison with the state of the art is purely naïve.Unfortunately, the works proposed in the literature on the same clinical task have often been developed starting from private datasets.Even when they use the same public dataset to which we referred, the authors integrated the public data with private data (as in the work presented by Kim et al. 49 ), without then differentiating the results obtained, or selected a subset of data according to certain criteria, which could be compatible with the objective of our work (as for the work presented by Aonpong et al. 50).Therefore, it is difficult to make objective comparisons on the same dataset.
However, our model still suffers from some limitations.Indeed, although a data augmentation technique has been used to reinforce the training of the last layers of the pre-trained networks used, the obtained performances are strongly influenced by the retrospective nature and small dimension of the dataset.Specifically, the model needs to be validated in a more robust manner also using an external validation set, preferably referring to a sample of private data, although the use of a public database as is known allows an objective comparison of the proposed methods.Therefore, for the future, we intend to collect a larger database of NSCLC patients to validate and optimize the proposed models; moreover, we will also evaluate other public dataset to test the obtained results.Another possible future direction in the research would include a further investigation of more Transformer architectures and their correspondent performances.Moreover, further studies could include both combined deep radiomic and clinical features to train suitable Machine Learning classifiers to predict NSCLC recurrence after years with the help of the Explainable Artificial Intelligence (XAI) to detect the most relevant and decisive features for the prediction 54,55 .

Experimental dataset
In our work, we used a public radiogenomics dataset of NSCLC available in the Cancer Imaging Archive (TCIA) 53 .The public database consisted of 211 subjects divided into two sub-cohorts: (1) The R01 cohort with 162 patients (38  We chose to focus only on the (1) sub-cohort R01 because they had both tumor segmentation binary masks and the axial CT available.Among the 162 patients of cohort R01, the tumor segmentation mask was not available for 18 patients, so the final number of patients involved in this study is equal to 144, of which 40 (27.78%) with a recurrence event within eight years from the first diagnosis.For each patient, a CT image in DICOM format was available and was acquired by preoperative CT scans with a thickness of 0.625-3 mm and an X-ray tube current at 124-699 mA at 80-140 KVp.On the other hand, the related segmentations were defined on the axial CT image series by thoracic radiologists with more than five years of experience and adjusted using ePAD software 53 .
In this study, the clinical data were not used, and the recurrence feature (yes = 1, no = 0) was chosen as a label for image classification.
For each patient, we first detected the segmentation mask with the largest tumour area and found the corresponding CT slide for the analysis as shown in Fig. 6.

ViTs, PVTs, Swins and CNNs architectures
After detecting the CTs with the largest tumor area, we adopted a deep learning transfer-learning approach involving pre-trained ViTs, PVTs, Swins, and CNNs.All the analysis steps were performed using Python programming language with Tensorflow-Keras 56,57 .
First, the original CT image pixels were normalized in the range [0;1] and then reshaped to the specific input size of the Transformers and CNNs.Then, the whole pre-processed images became the input for the various models.
The usual architecture of state-of-the-art CNNs, shown in Fig. 7, consists of three key elements represented by the convolutional layers, the pooling layers, and the fully connected.Once the CNN receives an input image suitably pre-processed, the convolutional layers are the ones dedicated to learning features from the input images, instead, the max-pooling layers are responsible for the reduction of the size of feature maps.At the end of the CNN, fully connected layers are added in a stacked way which, via a specific function (e.g., SoftMax or Sigmoid), provides classification [10][11][12][13][14][15][16][17][18][19][20] .
The architecture of the Transformers, shown in Fig. 8 according to the architecture of A. Dosovitskiy et al., is quite different from traditional CNNs 24 .ViTs derive from the original transformer model used in the natural language processing (NLP), where the input object consists of one-dimensional word tokens.The input images, of typical size 224 × 224 pixels, of height H, width W, and channels C are divided into smaller patches with number N = HW/P 2 being P × P the pixel size of the input image.To perform the classification task, ViTs are equipped with an encoder that receives the sequence of embedded picture patches, together with positional data, and a learnable class embedding suspended sequence.The latter is sent to the classification head coupled to the output of the encoder.Therefore, the data sequence is the following: • Original images are resized to size e.g., 224 × 224, and normalized between [0;1].They are then decomposed in the N patches.• The obtained patches are then flattened obtaining a linear patch projection.
• Learnable embeddings with patch projections are then concatenated.The positional embedding marks the order of the single patch in the sequence.• The output of the transformer encoder is sent to a Multilayer perceptron head (MLP) that with additional layers of this work, e.g., a Flatten layer, a Batch Normalization layer, a Dense layer with 64 units, another Batch Normalization layer, and the final Dense layer with sigmoid function shown in red dashed box of Fig. 3, provide classification.
In this study, we performed different experiments using two ViT models: a base model with 16 × 16 image patch size (ViTb_16) and a base model with 32 × 32 image patch size (ViTb_32) both consisting of 12 hidden Table 2. Table of the clinical features of the adopted dataset and their distributions."Nan" means "Not A Number" if the data is missing in the database, "abs" stands for "absolute value".layers [22][23][24] .PVTs represent a variant of the original ViTs and as stated by their name, they possess a columnar pyramid structure similar to traditional CNNs 29,30 .In this work, we adopted the improved version PVTs v2, from Wang et al. ( 2022), which introduced the linear complexity attention layer, the overlapping patch embedding and convolutional feed-forward network orthogonal to original PVTs.From now on, throughout the text, for the sake of simplicity, we will use the term PVT to indicate PVT v2 architecture of Wang et al. 29,30 .We considered two models of this family: PVT-B0 and PVT-B1.Both consist of four stages characterized by C i channel number of the output of stage i, R i reduction ratio, N i head number, E i expansion ratio of the feed forward layer, and L i number of encoder layers for i = 1-4 hyperparameters.For both L 1 -L 4 equals 2 whereas C i , for i = 1-4, of PVT-B1 is double of the correspondent PVT-B0 29,30 .The Swin Transformer is another Transformer architecture 27 .As the name states, Shifted Window, the key idea of this type of Transformer is to build a hierarchy starting from smallsized patches and gradually merging neighbouring patches into deep Tranformer layers.Between a self-attention layer and the next one, there is a window shift resulting in a new one.We adopted two types of this architecture consisting of the Swin-tiny and Swin-small which provided the best performances.The hyper-parameters of these types of Swins are represented by the channel number C of hidden layers in the first stage being C = 96 for both the Swin-tiny and small and the layer numbers being {2,2,6,2} ({2,2,18,2}) for the tiny one (small) 27 .For all the analyzed Transformers the ideal image size has been set to 224 × 224 pixels.In Python Tensorflow-Keras, ResNet50 requires input images of size 224 × 224 pixels with 177 total layers.Differently, InceptionV3 needs input images of 299 × 299 pixels with 313 total layers.In the end, DenseNet201 accepts input images of 224 × 224 pixels size with a total of 709 layers 56,57 .

Learning model
We built transfer learning models using pre-trained ViTs, PVTs, Swins, and CNNs on the ImageNet natural image dataset to train the dataset of NCSLC patients to predict the recurrence event [56][57][58] .The application of transfer learning to ViT, PVT, and Swin architectures consisted in replacing the last layer with the following layers: a flattening layer plus a batch normalization, one dense layer with Gelu activation function followed by another batch normalization, and the final dense layer as classifier with a sigmoid activation function.The red dashed box in Fig. 8 shows the added layers.This scheme was also adopted for CNNs replacing the Gelu with the Relu activation function for the added dense layer.These new networks were then trained for the image classification task.We implemented a stratified tenfold cross-validation in 20 external rounds on the entire dataset of 144 patients.In each fold of the cross-validation, 90% of the dataset corresponding to 130 elements is used as a training set, whereas the remaining 10%, corresponding to 14 elements, is used as the test set.
In this study, all the models were trained for 30 epochs in each fold of the cross-validation with batch size equal to 10 elements.Adam optimizer with an initial learning rate of 10-4 was used to optimize the weights of the network.To handle the imbalancing of the dataset, a sigmoid focal cross-entropy was used as loss function with balancing factor α and modulating factor β equal 0.25 and 2.0 respectively 59 .Considering our database is relatively small, to make our analysis more robust we implemented a data augmentation process, in addition to the transfer-learning approach, using three built-in Keras transformations such as Random Flip, Random Rotation, and Random Contrast 57 .This data augmentation was added as an additional layer in the models.
After the training phase, the model was used to predict the probability scores and then used to compute the performances via the Scikit-learn library functions 58 .Performances of classification of NSCLC recurrence for pre-trained ViTs, PVTs, Swins and CNNs have been evaluated in terms of the Area Under the Curve (AUC), Accuracy, Sensitivity, Specificity, and Precision.These metrics are computed in each of the 20 rounds of the stratified cross-validation so, in the end, the final performances, of the specific model, are evaluated as an average of all the 20 values with their corresponding standard deviation.To better balance these metrics, a Youden index test was performed 60 .

Figure 1 .
Figure 1.Radar plots of the performances AUC, Accuracy (Acc), Sensitivity (Sens), Specificity (Spe), and Precision (Pre) of ViTb_32 (a) and ViTb_16 (b).For each metric, the mean value, among all the cross-validation 20 rounds, is shown with its standard deviation.

Figure 3 .
Figure 3. Radar plots of the performances AUC, Accuracy (Acc), Sensitivity (Sens), Specificity (Spe), and Precision (Pre) of Swin-tiny (a) and Swin-small (b).For each metric, the mean value, among all the crossvalidation 20 rounds, is shown with its standard deviation.

Figure 5 .
Figure 5. Example of the training loss function and validation loss plots as a function of the 30 epochs of training (a).Histogram of the validation loss values, averaged over all the epochs, rounds, and folds of the crossvalidation for ViTs, PVTs, Swins, and CNNs (b).

Figure 6 .
Figure 6.An example of a binary Mask with the largest tumor area and its corresponding CT are shown.The yellow lines mark the tumor area in the CT.For each patient, we detected this correspondence, and the CTs, suitably rescaled in the range [0;1] and with a specific input size, were then used as input for the ViTs, PVTs, Swins, and CNNs.

Figure 7 .
Figure 7.Typical architecture of a CNN.It takes the input image, suitably resized, and elaborates it through a series of internal layers consisting of convolutional ones, max-pooling layers, and fully connected layers until final classification 10-20 .

Table 1 .
Table of the state-of-the-art performances achieved in previous works about NSCLC recurrence prediction.